library(here)
library(tidyverse) # data cleaning
library(ggplot2) # visualizations
source(here("functions.R")) # URF functions
library(randomForest)
library(factoextra)
library(limpca) # ASCADay 2 - Practical 2 - more advanced methods for metabolomics data analysis
2024 NUTRIM microbiome & metabolome workshop
1 Intro 🔆
In this tutorial, we are going to explore some more advanced methods and how to implement them in R:
2 Load R packages and some functions 📦
3 Read the data 🔍
We are going to use the same data as before, so you should have it already stored in your environment. However if you restarted the R session, you can use the code below to read the data again:
# Read metabolites data table
metabolites <- read.csv(file = here("data:papa2012/processed/metabolites_short.csv"), row.names = 1)
# PQN normalize it
metabolites_pqn <- pqn(metabolites)
# Read metadata
metadata_metabolites <- read.csv(file = here("data:papa2012/processed/metadata_metabolites.csv"), row.names = 1)4 URF 🌲🌲🌲
We have provided a function to perform URF in R:
script_URF(nr_itteration, real_data, nr_trees, nr_samples, class) TASK 1
Look at the code of the URF function below. You don’t need to understand or analyze the body of the function. Just look at the comments and try to analyze the inputs and outputs.
What are the input arguments ? What are the outcome(s) of the function? Tip: Look what’s inside the return.
script_URF <- function(nr_itteration, real_data, nr_trees, nr_samples) {
#Input arguments:
# nr_itteration --> number of times the procedure should be repeated; e.g.50, 100
# real_data --> data set (samples in row and variables in column
# nr_trees --> number of trees in the RF; 1500 or more
# nr_samples --> number of samples in a terminal nodes, default is 1 but better to put higher number 5 or 8
#Outputs:
#pc --> pca score plot
#pr--> % of variance per PC
#mean proximity matrix
require(randomForest)
require(factoextra)
proximity_all <- array(NA, dim = c(nrow(real_data), nrow(real_data), nr_itteration))
for(i in 1:nr_itteration) {
cat(i, "\n")
result <- unsupervisedRF_proximityadd(real_data, nr_trees, nr_samples) #[b_tree_3_classes2,proximity_realdata]=unsupervisedRF_proximityadd(real_data,nr_trees,nr_samples);
proximity_all[, ,i] <- result$proximity_realdata #proximity_all(:,:,i)=proximity_realdata;
}
# PCA on the mean value of proximity
mean_proximity <- apply(proximity_all, c(1,2), mean)
pc <- prcomp(doublecentering(1-mean_proximity),
center = FALSE,
scale = FALSE)
pc.eigval <- get_eigenvalue(pc)
return(list(pc = pc,
pr = pc.eigval,
mean_proximity = mean_proximity))
}TASK 2
Let’s perform our first URF! Call a function with 1500 trees and 8 samples in a terminal node. Limit the number of iterations up to 10, which will speed up the computations (generally we would use around 100 iterations). Assign the results to an object named URF.
URF is an non-linear method, therefore it will perform well with non-scaled data.
# write your code in the notebookURF <- script_URF(nr_itteration = 1, real_data = metabolites_pqn, nr_trees = 1500, nr_samples = 8)TASK 3
Look at the structure of the resulting URF model:
str(URF)List of 3
$ pc :List of 5
..$ sdev : num [1:241] 2.362 0.289 0.264 0.208 0.171 ...
..$ rotation: num [1:241, 1:241] 0.0478 0.0479 0.0566 0.0278 0.0451 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : NULL
.. .. ..$ : chr [1:241] "PC1" "PC2" "PC3" "PC4" ...
..$ center : logi FALSE
..$ scale : logi FALSE
..$ x : num [1:241, 1:241] -1.75 -1.75 -2.07 -1.02 -1.65 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : NULL
.. .. ..$ : chr [1:241] "PC1" "PC2" "PC3" "PC4" ...
..- attr(*, "class")= chr "prcomp"
$ pr :'data.frame': 241 obs. of 3 variables:
..$ eigenvalue : num [1:241] 5.581 0.0833 0.0695 0.0434 0.0291 ...
..$ variance.percent : num [1:241] 87.617 1.307 1.091 0.681 0.457 ...
..$ cumulative.variance.percent: num [1:241] 87.6 88.9 90 90.7 91.2 ...
$ mean_proximity: num [1:241, 1:241] 1 0.467 0.367 0.282 0.292 ...
TASK 4
Plot the URF score plot. Color the points by Diagnosis. What do you observe?
5 ASCA
There are many versions of ASCA, depending on the application and data design. Today, we will only familiarize with one version: ASCA+ from the R package limpca (Thiel et al. 2023).
More information about the package, as well as tutorials and examples with datasets can be found here.
5.1 ASCA - case study
We are going to inspect the effect of the disease (nonIBD, UC, CD) and the patient age group (minors vs adults) on the metabolites’ profiles. We suspect, that the effect of the disease on metabolites would differ based on the age (between minors and adults).
In our data, we have multiple samples per participant. This comes from sampling of patients at different times. However, we are not interested in this factor as it doesn’t follow any design.
Therefore, we have provided a new file metadata_metabolites_asca.csv with the metadata of only selected samples from one time point (one sample per patient).
TASK 1
Read metadata_metabolites_asca.csv. How many samples do we have now?
# write your code in the notebookmetadata_metabolites_asca <- read.csv(file = here("data:papa2012/asca/metadata_metabolites_asca.csv"), row.names = 1)
nrow(metadata_metabolites_asca)[1] 66
TASK 2
We also need to select those samples from the metabolites data table (PQN normalized).
metabolites_asca <- metabolites_pqn[which(rownames(metabolites_pqn) %in% metadata_metabolites_asca$SampleID), ]TASK 3
In the metadata, we have only variable Age, however no age groups. But, we can stratify our patients based on their age. Let’s create a vector Age_group by stratifying Age variable to minors (below 18) and adults, and add it to metadata_metabolites_asca. Tip: use ifelse() function.
# write your code in the notebookAge_group <- ifelse( metadata_metabolites_asca$Age <18, "minor", "adult")
metadata_metabolites_asca$Age_group <- Age_groupTASK 4
Let’s see, if our data is balanced now. Tip: use table() and provide both Age_group and Diagnosis from our metadata as arguments.
# write your code in the notebookSOLUTION:
table(Age_group, metadata_metabolites_asca$Diagnosis)
Age_group CD nonIBD UC
adult 12 11 13
minor 11 8 11
:::
5.2 Create an ASCA object
In order to use ASCA from the limpca package, the data need to be formatted as a list with the following elements:
outcomes (multivariate matrix with the responses - our metabolites data),
design (data.frame - our metadata),
formula (character string with factors and interactions).
First, create an empty list:
ASCA_model <- list()Outcomes
Outcomes should contain our metabolites data Append outcomes to the list. We need to transform our data, since ASCA uses linear models, which assume normal distribution:
ASCA_model$outcomes <- as.matrix(log(metabolites_asca+1))Design
Design has a form of a data frame with the values of our factors. Make sure to transform the variables into factors:
ASCA_model$design <- metadata_metabolites_asca %>%
dplyr::select(c("Diagnosis", "Age_group")) %>%
dplyr::mutate(Diagnosis = factor(Diagnosis, levels = c("nonIBD", "CD", "UC")),
Age_group = factor(Age_group, levels = c("minor", "adult")))
# Row names of the design matrix need to be the same as of the outcomes (sample names)
rownames(ASCA_model$design) <- metadata_metabolites_asca$SampleIDWe can use the function plotDesign() to plot the design matrix:
plotDesign(design = ASCA_model$design, x = "Diagnosis", y = "Age_group",
title = "Design of the metabolomics dataset")Look’s familiar? We have already inspected the design matrix by using table above. And our design has a good balance.
Formula
In the formula, we need to specify all factors of interest and the interactions between them:
According to ChatGPT, “An interaction term in a linear model represents the combined effect of two or more predictor variables on the response variable, beyond their individual effects. It allows the model to account for situations where the effect of one predictor on the response depends on the level of another predictor.”
What does it really mean? Let’s think of a simple example: we want to predict the results of an exam based on the study hours and use of the study guide. Without an interaction, the result will be a sum of the contributions of study hours and the use of the study guide. But the time we spend studying depends on whether we use the study guide or not, as we might study less but get a higher score on the exam in the end.
To sum up, an interaction allows the effect of one predictor to change depending on the level of another predictor, depiicting a more complex relationship between the predictors and the outcome.
ASCA_model$formula <- "outcomes ~ Diagnosis + Age_group + Diagnosis:Age_group"5.3 Perform ASCA
5.3.1 Model matrix generation
The first step of ASCA+ is to encode our design matrix as a model matrix to build linear models. Each factor and interaction is encoded with multiple binary variables:
resLmpModelMatrix <- lmpModelMatrix(ASCA_model)Print the model matrix:
head(resLmpModelMatrix$modelMatrix) (Intercept) Diagnosis1 Diagnosis2 Age_group1 Diagnosis1:Age_group1
CSM5FZ3T 1 0 1 -1 0
CSM5FZ48 1 -1 -1 -1 1
CSM5FZ4A 1 -1 -1 -1 1
CSM5FZ4O 1 -1 -1 -1 1
CSM5MCTZ 1 -1 -1 -1 1
CSM5MCXF 1 0 1 -1 0
Diagnosis2:Age_group1
CSM5FZ3T -1
CSM5FZ48 1
CSM5FZ4A 1
CSM5FZ4O 1
CSM5MCTZ 1
CSM5MCXF -1
5.3.2 Decomposition of effect matrices
The next step is to decompose the model matrix with linear models for every factor or interaction separately:
resLmpEffectMatrices <- lmpEffectMatrices(resLmpModelMatrix)This results in creation of so called effect matrices for every factor and interaction.
5.3.3 ASCA
Now, we can apply PCA simultaneously to all effect matrices (therefore SCA- simultaneous component analysis).
We can also combine the factors and interactions (by linear combinations) and apply PCA to the combined effect matrix, e.g. Diagnosis + Age_group + Diagnosis:Age_group, however then the interpretation gets more difficult.
Perform PCA of each of the factors/interactions:
resASCA <- lmpPcaEffects(resLmpEffectMatrices = resLmpEffectMatrices,
method = "ASCA")5.3.4 Contributions
Contribution to the total variance
From ASCA we obtain scores and loadings for each factor and interaction. We can display the total contribution of each factor and interaction to the total variance in our data (in %):
resLmpContributions <- lmpContributions(resASCA)
resLmpContributions$totalContribTable Percentage of Variance
Diagnosis 24.30
Age_group 2.20
Diagnosis:Age_group 2.57
Residuals 70.83
The values should add up to 100%.
PCs contributions in each model
We can also display the % of variance explained by each PC in each PCA model for the factors and interactions:
resLmpContributions$effectTable PC1 PC2 PC3 PC4 PC5 Sum
Diagnosis 93.35 6.65 0.0 0.00 0.00 100.00
Age_group 100.00 0.00 0.0 0.00 0.00 100.00
Diagnosis:Age_group 55.62 44.38 0.0 0.00 0.00 100.00
Residuals 22.09 9.03 4.9 4.66 3.95 44.63
Here, PCs from each factor/interaction sums up a 100% for that PCA model. Note that we have more PCs for the residuals, but we only display the first 5.
5.3.5 Statistical testing
Okay, we’ve got the total contribution of each factor/interaction. Based on that, we see that Diagnosis contributes to 24.3% of variance in metabolites’ profiles. But how can we know if this effect is significant? Here, we use bootstrapping with a big number of iterations, which will give us the p-values.
To test the significant of model effects by bootstrap with a reduced number of iterations (e.g. 100) iterations, which will reduce the computation time:
resLmpBootstrapTests <- lmpBootstrapTests(resLmpEffectMatrices = resLmpEffectMatrices,
nboot = 100)Display P-values :
t(resLmpBootstrapTests$resultsTable) Diagnosis Age_group Diagnosis:Age_group Residuals
% of variance (T III) "24.30" " 2.20" " 2.57" "70.83"
Bootstrap p-values "< 0.01" "0.02" "0.3" "-"
This way, we display the total contribution of each factor/interaction and additionaly their p-values.
5.3.6 Scores and loadings plots
Each PCA model for a factor/interaction or a PCA model for the combined effect results in scores and loadings. That’s why ASCA is a great tool, because it provides interpretability of the results.
Diagnosis was significant from the bootsraping, but let’s visualize a score plot for this factor:
And a loadings plot for the first PC:
Code
lmpLoading1dPlot(resASCA, effectNames = "Diagnosis",
axes = 1, # only PC1
xaxis_type = "character",)For the Diagnosis effect, the score plots show a clear separation between the different levels on the first PC. The distinct separation of the groups suggests a strong effect of that factors, which is also confirmed by it’s significance.
We can also visualize the scores and loadings plots for the interaction Diagnosis:Age_group:
Code
lmpScorePlot(resASCA, effectNames = "Diagnosis:Age_group",
color = "Diagnosis", shape = "Age_group")Based on the scores plot, the interaction seems important, but its not significant in bootstrapping.
Interactions can be further visualized with line plots:
Code
lmpEffectPlot(resASCA, effectName = "Diagnosis:Age_group", x = "Diagnosis",
z = "Age_group")Code
lmpEffectPlot(resASCA, effectName = "Diagnosis:Age_group", x = "Age_group",
z = "Diagnosis")Based on the interaction plots, UC and CD groups are more similar for all ages, but the minors and adults nonIBD patient differ.
6 That’s all folks!
Now you understand URF and ASCA and you should be able to apply them in R to your data!
7 Session info
sessioninfo::session_info()─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.4.1 (2024-06-14)
os macOS Sonoma 14.2.1
system aarch64, darwin20
ui X11
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz Europe/Paris
date 2024-06-30
pandoc 3.1.11 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/aarch64/ (via rmarkdown)
─ Packages ───────────────────────────────────────────────────────────────────
package * version date (UTC) lib source
abind 1.4-5 2016-07-21 [1] CRAN (R 4.4.0)
Biobase 2.64.0 2024-04-30 [1] Bioconductor 3.19 (R 4.4.0)
BiocGenerics 0.50.0 2024-04-30 [1] Bioconductor 3.19 (R 4.4.0)
cli 3.6.3 2024-06-21 [1] CRAN (R 4.4.0)
codetools 0.2-20 2024-03-31 [1] CRAN (R 4.4.1)
colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.4.0)
crayon 1.5.3 2024-06-20 [1] CRAN (R 4.4.0)
DelayedArray 0.30.1 2024-05-30 [1] Bioconductor 3.19 (R 4.4.0)
digest 0.6.36 2024-06-23 [1] CRAN (R 4.4.0)
doParallel 1.0.17 2022-02-07 [1] CRAN (R 4.4.0)
dplyr * 1.1.4 2023-11-17 [1] CRAN (R 4.4.0)
evaluate 0.24.0 2024-06-10 [1] CRAN (R 4.4.0)
factoextra * 1.0.7 2020-04-01 [1] CRAN (R 4.4.0)
fansi 1.0.6 2023-12-08 [1] CRAN (R 4.4.0)
farver 2.1.2 2024-05-13 [1] CRAN (R 4.4.0)
fastmap 1.2.0 2024-05-15 [1] CRAN (R 4.4.0)
forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.4.0)
foreach 1.5.2 2022-02-02 [1] CRAN (R 4.4.0)
generics 0.1.3 2022-07-05 [1] CRAN (R 4.4.0)
GenomeInfoDb 1.40.1 2024-06-16 [1] Bioconductor 3.19 (R 4.4.0)
GenomeInfoDbData 1.2.12 2024-06-29 [1] Bioconductor
GenomicRanges 1.56.1 2024-06-16 [1] Bioconductor 3.19 (R 4.4.0)
ggplot2 * 3.5.1 2024-04-23 [1] CRAN (R 4.4.0)
ggrepel 0.9.5 2024-01-10 [1] CRAN (R 4.4.0)
ggsci 3.2.0 2024-06-18 [1] CRAN (R 4.4.0)
glue 1.7.0 2024-01-09 [1] CRAN (R 4.4.0)
gtable 0.3.5 2024-04-22 [1] CRAN (R 4.4.0)
here * 1.0.1 2020-12-13 [1] CRAN (R 4.4.0)
hms 1.1.3 2023-03-21 [1] CRAN (R 4.4.0)
htmltools 0.5.8.1 2024-04-04 [1] CRAN (R 4.4.0)
htmlwidgets 1.6.4 2023-12-06 [1] CRAN (R 4.4.0)
httr 1.4.7 2023-08-15 [1] CRAN (R 4.4.0)
IRanges 2.38.0 2024-04-30 [1] Bioconductor 3.19 (R 4.4.0)
iterators 1.0.14 2022-02-05 [1] CRAN (R 4.4.0)
jsonlite 1.8.8 2023-12-04 [1] CRAN (R 4.4.0)
knitr 1.47 2024-05-29 [1] CRAN (R 4.4.0)
labeling 0.4.3 2023-08-29 [1] CRAN (R 4.4.0)
lattice 0.22-6 2024-03-20 [1] CRAN (R 4.4.1)
lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.4.0)
limpca * 1.0.0 2024-04-30 [1] Bioconductor 3.19 (R 4.4.0)
lubridate * 1.9.3 2023-09-27 [1] CRAN (R 4.4.0)
magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.4.0)
Matrix 1.7-0 2024-04-26 [1] CRAN (R 4.4.1)
MatrixGenerics 1.16.0 2024-04-30 [1] Bioconductor 3.19 (R 4.4.0)
matrixStats 1.3.0 2024-04-11 [1] CRAN (R 4.4.0)
munsell 0.5.1 2024-04-01 [1] CRAN (R 4.4.0)
pillar 1.9.0 2023-03-22 [1] CRAN (R 4.4.0)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.4.0)
plyr 1.8.9 2023-10-02 [1] CRAN (R 4.4.0)
purrr * 1.0.2 2023-08-10 [1] CRAN (R 4.4.0)
R6 2.5.1 2021-08-19 [1] CRAN (R 4.4.0)
randomForest * 4.7-1.1 2022-05-23 [1] CRAN (R 4.4.0)
Rcpp 1.0.12 2024-01-09 [1] CRAN (R 4.4.0)
readr * 2.1.5 2024-01-10 [1] CRAN (R 4.4.0)
reshape2 1.4.4 2020-04-09 [1] CRAN (R 4.4.0)
rlang 1.1.4 2024-06-04 [1] CRAN (R 4.4.0)
rmarkdown 2.27 2024-05-17 [1] CRAN (R 4.4.0)
rprojroot 2.0.4 2023-11-05 [1] CRAN (R 4.4.0)
rstudioapi 0.16.0 2024-03-24 [1] CRAN (R 4.4.0)
S4Arrays 1.4.1 2024-05-30 [1] Bioconductor 3.19 (R 4.4.0)
S4Vectors 0.42.0 2024-04-30 [1] Bioconductor 3.19 (R 4.4.0)
scales 1.3.0 2023-11-28 [1] CRAN (R 4.4.0)
sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.4.0)
SparseArray 1.4.8 2024-05-30 [1] Bioconductor 3.19 (R 4.4.0)
stringi 1.8.4 2024-05-06 [1] CRAN (R 4.4.0)
stringr * 1.5.1 2023-11-14 [1] CRAN (R 4.4.0)
SummarizedExperiment 1.34.0 2024-04-30 [1] Bioconductor 3.19 (R 4.4.0)
tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.4.0)
tidyr * 1.3.1 2024-01-24 [1] CRAN (R 4.4.0)
tidyselect 1.2.1 2024-03-11 [1] CRAN (R 4.4.0)
tidyverse * 2.0.0 2023-02-22 [1] CRAN (R 4.4.0)
timechange 0.3.0 2024-01-18 [1] CRAN (R 4.4.0)
tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.4.0)
UCSC.utils 1.0.0 2024-05-06 [1] Bioconductor 3.19 (R 4.4.0)
utf8 1.2.4 2023-10-22 [1] CRAN (R 4.4.0)
vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.4.0)
withr 3.0.0 2024-01-16 [1] CRAN (R 4.4.0)
xfun 0.45 2024-06-16 [1] CRAN (R 4.4.0)
XVector 0.44.0 2024-04-30 [1] Bioconductor 3.19 (R 4.4.0)
yaml 2.3.8 2023-12-11 [1] CRAN (R 4.4.0)
zlibbioc 1.50.0 2024-04-30 [1] Bioconductor 3.19 (R 4.4.0)
[1] /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library
──────────────────────────────────────────────────────────────────────────────